Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compact: Apply retention before compaction #5766

Conversation

igorwwwwwwwwwwwwwwwwwwww
Copy link
Contributor

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

The current behaviour of the compactor results in lots of unnecessary work being done if there is a compaction backlog.

Description

AFAICT the compactor will keep compacting until the entire backlog is cleared. Only then will it start downsampling and only once downsampling is complete, will it apply retention.

This means that we spend a lot of work compacting blocks that may already be outside of the retention window. The progress calculator determines which blocks would be marked for deletion. But since the retention code never gets to run, we don't actually perform that marking.

Since the compactor only skips blocks that actually were marked for deletion, we don't get to skip these would-be deleted blocks.

Analysis

  • The compact command performs compaction here:
    if err := compactor.Compact(ctx); err != nil {
  • It would apply retention here:
    if err := compact.ApplyRetentionPolicyByResolution(ctx, logger, bkt, sy.Metas(), retentionByResolution, compactMetrics.blocksMarked.WithLabelValues(metadata.DeletionMarkFilename, "")); err != nil {
  • However, if we look at the inner loop in the compact call:
    // Loop over bucket and compact until there's no work left.
  • We can see that as long as there are blocks that can be compacted, shouldRerunGroup will be true:
    shouldRerunGroup, _, err := g.Compact(workCtx, c.compactDir, c.planner, c.comp)
  • And thus we never break out of the loop until the entire backlog is processed:

    thanos/pkg/compact/compact.go

    Lines 1351 to 1353 in 09ddcc2

    if finishedAllGroups {
    break
    }

Proposal

This is a minimal change that moves the retention call before the compaction one. That ensures that at least upon compactor process restart, we will actually mark old blocks for deletion.

We'll then still go into the same loop, so no additional retention will be applied until either the backlog clears or the process is restarted.

This is not perfect. Some more concurrency or interleaving of compaction + retention would be desirable. But it is already quite a big improvement over the status quo.

Changes

  • Compact: Apply retention before compaction.

Verification

TBD.

Signed-off-by: Igor Wiedler <iwiedler@gitlab.com>
@igorwwwwwwwwwwwwwwwwwwww igorwwwwwwwwwwwwwwwwwwww force-pushed the compact-retention-before-compaction branch from 60c5772 to 15aae97 Compare October 6, 2022 12:15
return errors.Wrap(err, "sync before retention")
}

if err := compact.ApplyRetentionPolicyByResolution(ctx, logger, bkt, sy.Metas(), retentionByResolution, compactMetrics.blocksMarked.WithLabelValues(metadata.DeletionMarkFilename, "")); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern I have is that some raw blocks might beyond the retention. But they are able to be downsampled to 5m blocks that still within the 5m block retention period.
Changing the order makes users configure a longer retention for 5m blocks to get those downsampled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeya24 Yeah, this makes sense. So basically for this to be safe, ApplyRetentionPolicyByResolution would need to also check if we have downsampled a block already. And only apply retention if it's been downsampled.

However, this defeats the point (with the current implementation), as downsampling only occurs after compaction. So we are where we started again, unless we also perform downsampling before compaction, and have a check in the downsampler that ensures we are at the right compaction level for downsampling.

Copy link
Contributor

@yeya24 yeya24 Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving retention to the beginning could also mistakenly deleted blocks that are expected to be compacted. Since currently retention calculation uses the block's max time, so after a compaction we may keep the block instead of deleting the small blocks.
Corretness wise I don't think this is safe to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could handle this case by simulating all compaction + downsampling operations, and if the resulting block would be outside of retention, mark it for deletion immediately.

The specific case that this would help with: Lowering retention to purge the backlog. It would come at the cost of additional complexity though, so I'm not sure it's worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is for purging the backlog we have the bucket tool retention already. Does that satisfy your usecase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, TIL about bucket tool retention. Though I suspect it suffers from the same issue we discussed here: it does not take into account downsampling, so we run the risk of purging blocks that have not yet been downsampled.

In other words, using that tool is unsafe for the same reason that this proposed patch is unsafe.

Copy link
Contributor

@yeya24 yeya24 Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it is the same as this patch so you use it at your own risk.
If you agree it is the same then do we still need this patch? If you have other things to improve we can discuss on the issue #1708

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think we can close this. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants